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Abstract 

In this paper we relate a number of parsing algorithms 
which have been developed in very different areas of 
parsing theory, and which include deterministic algo- 
rithms, tabular algorithms, and a parallel algorithm. 
We show that these algorithms are based on the same 
underlying ideas. 

By relating existing ideas, we hope to provide an op- 
portunity to improve some algorithms based on features 
of others. A second purpose of this paper is to answer a 
question which has come up in the area of tabular pars- 
ing, namely how to obtain a parsing algorithm with the 
property that the table will contain as little entries as 
possible, but without the possibility that two entries 
represent the same sub derivation. 

Introduction 

Left-corner (LC) parsing is a parsing strategy which 
has been used in different guises in various areas of com- 
puter science. Deterministic LC parsing with k symbols 
of lookahead can handle the class of LC(fc) grammars. 
Since LC parsing is a very simple parsing technique and 
at the same time is able to deal with left recursion, it is 
often used as an alternative to top-down (TD) parsing, 
which cannot handle left recursion and is generally less 
efficient. 

Nondeterministic LC parsing is the foundation of a 
very efficient parsing algorithm [7], related to Tomita's 
algorithm and Earley's algorithm. It has one disad- 
vantage however, which becomes noticeable when the 
grammar contains many rules whose right-hand sides 
begin with the same few grammars symbols, e.g. 

A — ► a Pi | af3'2 | • • • 

where a is not the empty string. After an LC parser 
has recognized the first symbol X of such an a, it will 
as next step predict all aforementioned rules. This 
amounts to much nondeterminism, which is detrimental 
both to the time-complexity and the space-complexity. 
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A first attempt to solve this problem is to use predic- 
tive LR (PLR) parsing. PLR parsing allows simulta- 
neous processing of a common prefix a, provided that 
the left-hand sides of the rules are the same. However, 
in case we have e.g. the rules A — ► a/3i and B — ► a/32, 
where again a is not the empty string but now A ^ B, 
then PLR parsing will not improve the efficiency. We 
therefore go one step further and discuss extended LR 
(ELR) and common-prefix (CP) parsing, which are al- 
gorithms capable of simultaneous processing of all com- 
mon prefixes. ELR and CP parsing are the foundation 
of tabular parsing algorithms and a parallel parsing al- 
gorithm from the existing literature, but they have not 
been described in their own right. 

To the best of the author's knowledge, the various 
parsing algorithms mentioned above have not been dis- 
cussed together in the existing literature. The main 
purpose of this paper is to make explicit the connec- 
tions between these algorithms. 

A second purpose of this paper is to show that CP 
and ELR parsing are obvious solutions to a problem of 
tabular parsing which can be described as follows. For 
each parsing algorithm working on a stack there is a 
realisation using a parse table, where the parse table 
allows sharing of computation between different search 
paths. For example, Tomita's algorithm [18] can be seen 
as a tabular realisation of nondeterministic LR parsing. 

At this point we use the term state to indicate the 
symbols occurring on the stack of the original algo- 
rithm, which also occur as entries in the parse table 
of its tabular realisation. 

In general, powerful algorithms working on a stack 
lead to efficient tabular parsing algorithms, provided 
the grammar can be handled almost deterministically. 
In case the stack algorithm is very nondeterministic for 
a certain grammar however, sophistication which in- 
creases the number of states may lead to an increasing 
number of entries in the parse table of the tabular re- 
alization. This can be informally explained by the fact 
that each state represents the computation of a number 
of sub derivations. If the number of states is increased 
then it is inevitable that at some point some states 
represent an overlapping collection of sub derivations, 



which may lead to work being repeated during parsing. 
Furthermore, the parse forest (a compact representa- 
tion of all parse trees) which is output by a tabular 
algorithm may in this case not be optimally dense. 

We conclude that we have a tradeoff between the case 
that the grammar allows almost deterministic parsing 
and the case that the stack algorithm is very nondeter- 
ministic for a certain grammar. In the former case, so- 
phistication leads to less entries in the table, and in the 
latter case, sophistication leads to more entries, pro- 
vided this sophistication is realised by an increase in 
the number of states. This is corroborated by empirical 
data from [1, 4], which deal with tabular LR parsing. 

As we will explain, CP and ELR parsing are more 
deterministic than most other parsing algorithms for 
many grammars, but their tabular realizations can 
never compute the same subderivation twice. This rep- 
resents an optimum in a range of possible parsing algo- 
rithms. 

This paper is organized as follows. First we discuss 
nondeterministic left-corner parsing, and demonstrate 
how common prefixes in a grammar may be a source of 
bad performance for this technique. 

Then, a multitude of parsing techniques which ex- 
hibit better treatment of common prefixes is dis- 
cussed. These techniques, including nondeterministic 
PLR, ELR, and CP parsing, have their origins in theory 
of deterministic, parallel, and tabular parsing. Subse- 
quently, the application to parallel and tabular parsing 
is investigated more closely. 

Further, we briefly describe how rules with empty 
right-hand sides complicate the parsing process. 

The ideas described in this paper can be generalized 
to head-driven parsing, as argued in [9]. 

We will take some liberty in describing algorithms 
from the existing literature, since using the original de- 
scriptions would blur the similarities of the algorithms 
to one another. In particular, we will not treat the use 
of lookahead, and we will consider all algorithms work- 
ing on a stack to be nondeterministic. We will only 
describe recognition algorithms. Each of the algorithms 
can however be easily extended to yield parse trees as 
a side-effect of recognition. 

The notation used in the sequel is for the most part 
standard and is summarised below. 

A context-free grammar G = (T, N, P, S) consists of 
two finite disjoint sets N and T of nonterminals and 
terminals, respectively, a start symbol S £ N , and a 
finite set of rules P. Every rule has the form A — ► a, 
where the left-hand side (lhs) A is an element from N 
and the right-hand side (rhs) a is an element from V* , 
where V denotes (N UT). P can also be seen as a 
relation on N x V* . 

We use symbols A,B,C, . . . to range over N, symbols 
a,b,c, . . . to range over T, symbols X, Y, Z to range over 
V, symbols a, /3,y, . . . to range over V* , and v,w,x, . . . 
to range over T*. We let e denote the empty string. The 



notation of rules A — ► a\,A — ► a.2, ■ ■ ■ with the same 
lhs is often simplified to A — ► ai|«2| • • • 

A rule of the form A — ► e is called an epsilon rule. 
We assume grammars do not have epsilon rules unless 
stated otherwise. 

The relation P is extended to a relation — ► on V* x V* 
as usual. The reflexive and transitive closure of — ► is 
denoted by — >■* . 

We define: BLAH and only if A — ► Bat for some a. 
The reflexive and transitive closure of Z is denoted by 
/*, and is called the left-corner relation. 

We say two rules A — ► cx\ and B — ► have a com- 
mon prefix [3 if cx\ = /3ji and = /3j2, for some ji 
and 72, where [3 ^ e. 

A recognition algorithm can be specified by means 
of a push-down automaton A = (T, Alph, Init, h, Fin), 
which manipulates configurations of the form (r,i>), 
where T £ Alph* is the stack, constructed from left 
to right, and v £ T* is the remaining input. 

The initial configuration is (Init,w), where Init £ 
Alph is a distinguished stack symbol, and w is the input. 
The steps of an automaton are specified by means of the 
relation h. Thus, (r,i>) h (T',v') denotes that (T',v') 
is obtainable from (T, v) by one step of the automaton. 
The reflexive and transitive closure of h is denoted by 
h*. The input w is accepted if (Init,w) h* (Fm,e), 
where Fin £ Alph is a distinguished stack symbol. 

LC parsing 

For the definition of left-corner (LC) recognition [7] we 
need stack symbols (items) of the form [A — ► a • /?], 
where A — ► a/3 is a rule, and a ^ e. (Remember that 
we do not allow epsilon rules.) The informal meaning 
of an item is "The part before the dot has just been 
recognized, the first symbol after the dot is to be rec- 
ognized next" . For technical reasons we also need the 
items [5" —> • S] and [5" —> S •], where 5" is a fresh 
symbol. Formally: 

I LC = {[A a • /3] | A a/3 e A(a ^ eW A = S')} 

where ft represents the augmented set of rules, consist- 
ing of the rules in P plus the extra rule S' —> S. 

Algorithm 1 (Left-corner) 

A LC = ( Tj jLC Imi ^ hj p m ^ j mi = 5] ; p m = 

[S' —> S •]. Transitions are allowed according to the 
following clauses. 

1. (T[B (3 • Cj],av) h 

(T[B (3 • Cj][A a • a],v) 
where there is A — ► aa £ ft such that A L* C 

2. (T[A a • a/3],av) h (T[A aa • f3],v) 

3. (T[B ^ /3 • Cj][A ^ a •},v) h 

(T[B ^(3*Cy][D^A*8],v) 
where there is D — ► AS £ ft such that D L* C 

4. (T[B (3 • Aj][A a •], v) h (T[B (3A • 7], v) 

The conditions using the left-corner relation Z* in the 
first and third clauses together form a feature which is 



called top-down (TD) filtering. TD filtering makes sure 
that sub derivations that are being computed bottom- 
up may eventually grow into sub derivations with the re- 
quired root. TD filtering is not necessary for a correct 
algorithm, but it reduces nondeterminism, and guar- 
antees the correct- prefix property, which means that in 
case of incorrect input the parser does not read past the 
first incorrect character. 

Example 1 Consider the grammar with the following 
rules: 

E E + T \ T t E \ T 

T T*F \ T**F | F 

F a 

It is easy to see that E L E , T L E , T L T , F L T . 
The relation Z* contains Z but from the reflexive closure 
it also contains F Z* F and from the transitive closure 
it also contains F L* E . 

The recognition of a * a is realised by: 




Note that since the automaton does not use any looka- 
head, Step 3 may also have replaced [T —> F •] by 
any other item besides [T —> T • * F] whose rhs starts 
with T and whose lhs satisfies the condition of top- 
down filtering with regard to E, i.e. by [T —> T • **P], 
[E->T» t E], or [E^T»]. □ 

LC parsing with k symbols of lookahead can handle 
deterministically the so called LC(fc) grammars. This 
class of grammars is formalized in [13]. 1 How LC pars- 
ing can be improved to handle common suffixes effi- 
ciently is discussed in [6]; in this paper we restrict our 
attention to common prefixes. 

PLR, ELR, and CP parsing 

In this section we investigate a number of algorithms 
which exhibit a better treatment of common prefixes. 

Predictive LR parsing 

Predictive LR (PLR) parsing with k symbols of looka- 
head was introduced in [17] as an algorithm which yields 
efficient parsers for a subset of the LR(fc) grammars [16] 
and a superset of the LC(fc) grammars. How determin- 
istic PLR parsing succeeds in handling a larger class 
of grammars (the PLR(fc) grammars) than the LC(fc) 
grammars can be explained by identifying PLR parsing 



for some grammar G with LC parsing for some gram- 
mar G' which results after applying a transformation 
called left- factoring. 

Left-factoring consists of replacing two or more rules 
A — ► af3i\af3'2 \ ■ ■ ■ with a common prefix a by the rules 
A — ► a A' and A' — ► f3\ |/?2 1 • • ., where A' is a fresh non- 
terminal. The effect on LC parsing is that a choice 
between rules is postponed until after all symbols of a 
are completely recognized. Investigation of the next k 
symbols of the remaining input may then allow a choice 
between the rules to be made deterministically. 

The PLR algorithm is formalised in [17] by trans- 
forming a PLR(fc) grammar into an LL(fc) grammar 
and then assuming the standard realisation of LL(fc) 
parsing. When we consider nondeterministic top-down 
parsing instead of LL(fc) parsing, then we obtain the 
new formulation of nondeterministic PLR(O) parsing 
below. 

We first need to define another kind of item, viz. of 
the form [A — ► a] such that there is at least one rule of 
the form A — ► a/3 for some [3. Formally: 
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Informally, an item [A - 
more items [A —>■ a • /3] 

Algorithm 2 (Predictive LR) 

a plr = (r,I plR ,Init,\-,Fin), Init = [S' ], Fin = 
[S' —> S], and h defined by: 

1. (T[B /3], av) h (T[B /3][A a],v) 

where there are A — ► aa, B -> /3Cj £ Pt such that 
AL* C 

2. (T[A a],av) h (T[A aa],v) 
where there is A — ► aa/3 £ pt 

3. (T[B [3][A a], v) h (T[B ->/?][£)-► A], v) 
where A — ► a £ pt and where there are D — ► 
AS, B /3Cj £ Pt such that D Z* C 

4. (T[B (3][A a], v) h (T[B f3A],v) 

where A a £ pt and where there is B f3Aj £ 
Pt 

Example 2 Consider the grammar from Example 1. 
Using Predictive LR, recognition of a * a is realised by: 




1 In [17] a different definition of the LC(fc) grammars may 
be found, which is not completely equivalent. 



Comparing these configurations with those reached by 
the LC recognizer, we see that here after Step 3 the 
stack element [T —> F] represents both [T —> F • * F] 
and [T —> F • **F], so that nondeterminism is reduced. 
Still some nondeterminism remains, since Step 3 could 
also have replaced [T —> F] by [E —>■ T], which repre- 
sents both [E —>F • ] E] and [E —>F •]. □ 



Extended LR parsing 

An extended context-free grammar has right-hand sides 
consisting of arbitrary regular expressions over V . This 
requires an LR parser for an extended grammar (an 
ELR parser) to behave differently from normal LR 
parsers. 

The behaviour of a normal LR parser upon a reduc- 
tion with some rule A — ► a is very simple: it pops \a\ 
states from the stack, revealing, say, state Q; it then 
pushes state goto(Q, A). (We identify a state with its 
corresponding set of items.) 

For extended grammars the behaviour upon a reduc- 
tion cannot be realised in this way since the regular 
expression of which the rhs is composed may describe 
strings of various lengths, so that it is unknown how 
many states need to be popped. 

In [11] this problem is solved by forcing the parser to 
decide at each call goto(Q, X) whether 

a) X is one more symbol of an item in Q of which some 
symbols have already been recognized, or whether 

b) X is the first symbol of an item which has been 
introduced in Q by means of the closure function. 

In the second case, a state which is a variant of 
goto(Q, X) is pushed on top of state Q as usual. In 
the first case, however, state Q on top of the stack is 
replaced by a variant of goto(Q, X). This is safe since 
we will never need to return to Q if after some more 
steps we succeed in recognizing some rule correspond- 
ing with one of the items in Q. A consequence of the 
action in the first case is that upon reduction we need 
to pop only one state off the stack. 

Further work in this area is reported in [5], which 
treats nondeterministic ELR parsing and therefore does 
not regard it as an obstacle if a choice between cases a) 
and b) cannot be uniquely made. 

We are not concerned with extended context-free 
grammars in this paper. However, a very interesting 
algorithm results from ELR parsing if we restrict its ap- 
plication to ordinary context-free grammars. (We will 
maintain the name "extended LR" to stress the origin 
of the algorithm.) This results in the new nondetermin- 
istic ELR(O) algorithm that we describe below, derived 
from the formulation of ELR parsing in [5]. 

First, we define a set of items as 

I = {[A —>■ a • /3] | A a/3 £ P*} 

Note that I LC C I. If we define for each Q C I: 

closure(Q) = 

Q U {[A • a] | [B /3 • Cj] £ Q A A L* C} 

then the goto function for LR(0) parsing is defined by 

goto(Q,X) = 

closure({[A aX • /?] | [A a • Xf3] £ Q}) 

For ELR parsing however, we need two goto func- 
tions, goto 1 and goto 2 , one for kernel items (i.e. those 



m I LC ) and one for nonkernel items (the others). These 
are defined by 

goto^Q^) = 

closure({[A -> aX • (3] \ [A —> a • X(3] £ Q A 

(a^ eW A = S r )}) 

goto 2 (Q,X) = 

closure({[A X • /?] | [A -> • X/3] £ Q A A ^ 5"}) 

At each shift (where X is some terminal) and each re- 
duce with some rule A — ► a (where X is A) we may non- 
deterministically apply goto 1 , which corresponds with 
case a), or goto 2 , which corresponds with case b). Of 
course, one or both may not be defined on Q and X, 
because goto^Q , X) may be 0, for i £ {1,2}. 

Now remark that when using goto 1 and goto 2 , each 
reachable set of items contains only items of the form 
A — ► a • [3, for some fixed string a, plus some nonkernel 
items. We will ignore the nonkernel items since they 
can be derived from the kernel items by means of the 
closure function. 

This suggests representing each set of items by a new 
kind of item of the form [{A\ , A 2 , . . . , A n } — *■ a] , which 
represents all items A — ► a • [3 for some [3 and A £ 
{A 1 ,A 2 ,...,A n }. Formally: 

jELR = {[A^ a ] | c AC {A | A^afleP^A 
(«^VA = {?})} 

where we use the symbol A to range over sets of non- 
terminals. 

Algorithm 3 (Extended LR) 

A ELR = ( Tj jELR^ j mt ^ hj p m ^ j mt = ] ; p m = 

[{S'}^ S], and h defined by: 

1. (r[A ^ 0],av) h (r[A ^ 0][A' ^a],v) 

where A' = {A \ 3A aa,B (3Cy £ P^[B £ 
A A A L* C]} is non-empty 

2. (r[A -> a],av) h (r[A' -> aa],v) 

where A' = {A £ A | A — ► aa/3 £ ft} is non-empty 

3. (r[A ^ /3][A' ^a],v)\- (T[A ^ f]][A" —> A],v) 
where there is A — ► a £ pt with A £ A', and A" = 
{D\3D^ AS, B /3Cj £ pt[B £ A A D L* C}} is 
non-empty 

4. (r[A ^ /3][A' ^a],v)\- (T[A" -+ pA],v) 

where there is A — ► a £ P^ with A £ A', and A" = 
{B £ A | B —> /3Aj £ Pt} is non-empty 

Note that Clauses 1 and 3 correspond with goto 2 and 
that Clauses 2 and 4 correspond with goto 1 . 

Example 3 Consider again the grammar from Exam- 
ple 1. Using the ELR algorithm, recognition of a * a is 
realised by: 
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Comparing these configurations with those reached by 
the PLR recognizer, we see that here after Step 3 the 
stack element [{T, E} — ► T] represents both [T —> T • 
* F] and [T T • * * F], but also [E —>■ T •] and 
[E — ► T • | f ] , so that nondeterminism is even further 
reduced. □ 

A simplified ELR algorithm, which we call the pseudo 
ELR algorithm, results from avoiding reference to A in 
Clauses 1 and 3. In Clause 1 we then have a simplified 
definition of A', viz. A' = {A | 3 A aa, B /3Cj £ 
ft [A L* C]}, and in the same way we have in Clause 3 
the new definition A" = {D | 3D AS, B -> /3Cj £ 
f t[D L* C]}. Pseudo ELR parsing can be more easily 
realised than full ELR parsing, but the correct-prefix 
property can no longer be guaranteed. Pseudo ELR 
parsing is the foundation of a tabular algorithm in [20]. 

Common-prefix parsing 

One of the more complicated aspects of the ELR algo- 
rithm is the treatment of the sets of nonterminals in 
the left-hand sides of items. A drastically simplified 
algorithm is the basis of a tabular algorithm in [21]. 
Since in [21] the algorithm itself is not described but 
only its tabular realisation, 2 we take the liberty of giv- 
ing this algorithm our own name: common-prefix ( CP) 
parsing, since it treats all rules with a common prefix 
simultaneously. 3 

The simplification consists of omitting the sets of 
nonterminals in the left-hand sides of items: 

J cp = {[->£*] | A ^ a/3 £ ft} 

Algorithm 4 (Common-prefix) 

A cp = (T,I cp ,Init,\-,Fin), Imt = H, Fin = [-► S], 
and h defined by: 

1. (rhfl, fl »)h(rh^h a ],«) 

where there are A — ► aa, B /3Cj £ ft such that 
A I* C 

2. (r[-> a],av) h (r[-> aa],v) 
where there is A — ► aa/3 £ ft 

3. (rh/J]h4^(rhl-^],^) 

where there are A — ► a, D — ► A6,B — ► f3Cy £ ft 
such that D I* C 

4. (rh^h4«)h(rh^],«) 

where there are A — ► a, B — ► f3Aj £ ft 

The simplification which leads to the CP algorithm 
inevitably causes the correct-prefix property to be lost. 

Example 4 Consider again the grammar from Exam- 
ple 1. It is clear that a + a | a is not a correct string 
according to this grammar. The CP algorithm may go 
through the following sequence of configurations: 

2 An attempt has been made in [19] but this paper does 
not describe the algorithm in its full generality. 

3 The original algorithm in [21] applies an optimization 
concerning unit rules, irrelevant to our discussion. 
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We see that in Step 9 the first incorrect symbol ] is read, 
but recognition then continues. Eventually, the recog- 
nition process is blocked in some unsuccessful configu- 
ration, which is guaranteed to happen for any incorrect 
input 4 . In general however, after reading the first incor- 
rect symbol, the algorithm may perform an unbounded 
number of steps before it halts. (Imagine what happens 
for input of the form a + a'\a + a + a + ... + a.) □ 

Tabular parsing 

Nondeterministic push-down automata can be realised 
efficiently using parse tables [1]. A parse table consists 
of sets Tij of items, for < i < j < n, where a\ . . . a n 
represents the input. The idea is that an item is only 
stored in a set Tj if the item represents recognition of 
the part of the input a;+i . . .a,j. 

We will first discuss a tabular form of CP parsing, 
since this is the most simple parsing technique discussed 
above. We will then move on to the more difficult ELR 
technique. Tabular PLR parsing is fairly straightfor- 
ward and will not be discussed in this paper. 

Tabular CP parsing 

CP parsing has the following tabular realization: 

Algorithm 5 (Tabular common-prefix) 

Sets T{j of the table are to be subsets of I cp . Start 
with an empty table. Add [— *■] to To,o- Perform one of 
the following steps until no more items can be added. 

1. Add [— » a] to T-i i for a = a; and [— » /?] £ 

where there are A — ► aa, B -> /3Cj £ ft such that 
AL* C 

2. Add [— » aa] to Tj i for a = a; and [— » a] £ 
where there is A — ► aa/3 £ ft 

3. Add A] to T jti for a] £ T jti and /?] £ T h j 
where there are A — ► a, D — ► A6,B — ► flCy £ ft 
such that D L* C 

4. Add [-► to T h|i for [-► a] £ 2},,- and [-► /3] £ T ftj - 
where there are A — ► a, B — ► /3j47 £ ft 

Report recognition of the input if [— » 5 1 ] £ To,n- 
For an example, see Figure 1. 

Tabular CP parsing is related to a variant of CYK 
parsing with TD filtering in [5]. A form of tabular 

4 unless the grammar is cyclic, in which case the parser 
may not terminate, both on correct and on incorrect input 
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Consider again the grammar from 
Example 1 and the (incorrect) in- 
put a + a ] a. After execution 
of the tabular common-prefix al- 
gorithm, the table is as given here. 
The sets Tji are given at the j-th 
row and i-th column. 
The items which correspond with 
those from Example 4 are labelled 
with (0), (1), . . . These labels also 
indicate the order in which these 
items are added to the table. 



Figure 1: Tabular 

CP parsing without top-down filtering (i.e. without the 
checks concerning the left-corner relation Z*) is the 
main algorithm in [21]. 

Without the use of top-down filtering, the references 
to [— » [3] in Clauses 1 and 3 are clearly not of much use 
any more. When we also remove the use of these items, 
then these clauses become: 



1 



Add a] to T 8 _i 
where there is A - 
3. Add A] to T jti 
where there are A 



for a = a,i 

aa £ 

for [—> a] £ T jti 
+ a,D AS £ ft 

In the resulting algorithm, no set Tj depends on any 
set Tg h with g < i. In [15] this fact is used to construct 
a parallel parser with n processors Po, . . . , P n -\, with 
each Pi processing the sets T j for all j > i. The flow 
of data is strictly from right to left, i.e. items computed 
by Pi are only passed on to Po, . . . , Pi-i- 

Tabular ELR parsing 

The tabular form of ELR parsing allows an optimiza- 
tion which constitutes an interesting example of how a 
tabular algorithm can have a property not shared by its 
nondeterministic origin. 5 

First note that we can compute the columns of a 
parse table strictly from left to right, that is, for fixed i 
we can compute all sets Tj i before we compute the sets 

If we formulate a tabular ELR algorithm in a naive 
way analogously to Algorithm 5, as is done in [5], then 
for example the first clause is given by: 

1. Add [A' — ► a] to T 8 _i ; for a = a; and 

[A^/3] £7) )i _ 1 
where A' = {A \ 3A —> aa,B -> (]Cj £ P^[B £ 
A A A L* C]} is non-empty 



5 This is reminiscent of the admissibility tests [3], which 
are applicable to tabular realisations of logical push-down 
automata, but not to these automata themselves. 



CP parsing 

However, for certain i there may be many [A —*■/?] £ 
Tj i-i, for some j, and each may give rise to a different 
A' which is non-empty. In this way, Clause 1 may add 
several items [A' — ► a] to T-i i, some possibly with 
overlapping sets A'. Since items represent computation 
of sub derivations, the algorithm may therefore compute 
the same sub derivation several times. 

We propose an optimization which makes use of the 
fact that all possible items [A —*■/?] £ are already 

present when we compute items in T-i i: we compute 
one single item [A' — ► a], where A' is a large set com- 
puted using all [A —*■/?] £ 7) )8 '_i, for any j. A similar 
optimization can be made for the third clause. 

Algorithm 6 (Tabular extended LR) 

Sets T j of the table are to be subsets of I ELR . Start 
with an empty table. Add [{S 1 } —> ] to To t o- F° r 
i = 1, . . . , n, in this order, perform one of the following 
steps until no more items can be added. 

1. Add [A' — ► a] to T-i i for a = a 8 - 

where A' = {A \ 3j3[A -►/?]£ I),;-i^4 -^aa,B 
f3Cj £ Pt[5 £ A A A L* C]} is non-empty 

2. Add [A' — ► aa] to 7) 8 - for a = a; and 

[A ^ a] £ 

where A' = {A £ A | A — ► aa/3 £ P T } is non-empty 

3. Add [A" A] to T jti for [A' a] £ T jti 
where there is A — ► a £ pt with A £ A', and A" = 
{D | 3/i3[A -►/?]£ T hJ 3D A6,B f3Cj £ 
Pt[B £ A A D I* C]} is non-empty 

4. Add [A" /3 A] to T h i for [A' a] £ 7) ; and 

[A - 0] £ T„,;- 
where there is A — ► a £ pt with A £ A', and A" = 
{P. £ A | B —> /3Aj £ Pt} is non-empty 

Report recognition of the input if [{S 1 } —> S] £ To,n- 

Informally, the top-down filtering in the first and 
third clauses is realised by investigating all left corners 
D of nonterminals C (i.e. D L* C) which are expected 



from a certain input position. For input position i these 
nonterminals D are given by 

Si = {D | 3j3[A - /3] e T jti 

3B (]Cj e P\B eAAD L* C]} 

Provided each set Si is computed just after comple- 
tion of the i-th column of the table, the first and third 
clauses can be simplified to: 

1. Add [A' — ► a] to T-ii for a = a; 

where A' = {A | A -> aa £ ft} n Si _i is non-empty 

3. Add [A" A] to T jti for [A' a] £ 7),; 

where there is A — ► a G with i£ A', and A" = 
{D \ D ^ A8 <E P^} C\ Sj is non-empty 

which may lead to more practical implementations. 

Note that we may have that the tabular ELR algo- 
rithm manipulates items of the form [A — ► a] which 
would not occur in any search path of the nondeter- 
ministic ELR algorithm, because in general such a A 
is the union of many sets A' of items [A' — ► a] which 
would be manipulated at the same input position by the 
nondeterministic algorithm in different search paths. 

With minor differences, the above tabular ELR algo- 
rithm is described in [21]. A tabular version of pseudo 
ELR parsing is presented in [20]. Some useful data 
structures for practical implementation of tabular and 
non-tabular PLR, ELR and CP parsing are described 
in [8]. 

Finding an optimal tabular algorithm 

In [14] Schabes derives the LC algorithm from LR pars- 
ing similar to the way that ELR parsing can be derived 
from LR parsing. The LC algorithm is obtained by not 
only splitting up the goto function into goto 1 and goto 2 
but also splitting up goto 2 even further, so that it non- 
deterministically yields the closure of one single kernel 
item. (This idea was described earlier in [5], and more 
recently in [10].) 

Schabes then argues that the LC algorithm can be 
determinized (i.e. made more deterministic) by manip- 
ulating the goto functions. One application of this idea 
is to take a fixed grammar and choose different goto 
functions for different parts of the grammar, in order 
to tune the parser to the grammar. 

In this section we discuss a different application of 
this idea: we consider various goto functions which are 
global, i.e. which are the same for all parts of a grammar. 
One example is ELR parsing, as its goto 2 function can 
be seen as a determinized version of the goto 2 function 
of LC parsing. In a similar way we obtain PLR parsing. 
Traditional LR parsing is obtained by taking the full 
determinization, i.e. by taking the normal goto function 
which is not split up. 6 

6 Schabes more or less also argues that LC itself can be 
obtained by determinizing TD parsing. (In lieu of TD pars- 
ing he mentions Earley's algorithm, which is its tabular 
realisation.) 



We conclude that we have a family consisting of LC, 
PLR, ELR, and LR parsing, which are increasingly de- 
terministic. In general, the more deterministic an algo- 
rithm is, the more parser states it requires. For exam- 
ple, the LC algorithm requires a number of states (the 
items in I LC ) which is linear in the size of the gram- 
mar. By contrast, the LR algorithm requires a number 
of states (the sets of items) which is exponential in the 
size of the grammar [2] . 

The differences in the number of states complicates 
the choice of a tabular algorithm as the one giving op- 
timal behaviour for all grammars. If a grammar is very 
simple, then a sophisticated algorithm such as LR may 
allow completely deterministic parsing, which requires a 
linear number of entries to be added to the parse table, 
measured in the size of the grammar. 

If, on the other hand, the grammar is very ambigu- 
ous such that even LR parsing is very nondeterministic, 
then the tabular realisation may at worst add each state 
to each set Tj, so that the more states there are, the 
more work the parser needs to do. This favours sim- 
ple algorithms such as LC over more sophisticated ones 
such as LR. Furthermore, if more than one state repre- 
sents the same sub derivation, then computation of that 
subderivation may be done more than once, which leads 
to parse forests (compact representations of collections 
of parse trees) which are not optimally dense [1, 12, 7]. 

Schabes proposes to tune a parser to a grammar, or 
in other words, to use a combination of parsing tech- 
niques in order to find an optimal parser for a certain 
grammar. 7 This idea has until now not been realised. 
However, when we try to find a single parsing algorithm 
which performs well for all grammars, then the tabu- 
lar ELR algorithm we have presented may be a serious 
candidate, for the following reasons: 

• For all i, j, and a at most one item of the form 
[A — ► a] is added to Tj. Therefore, identical sub- 
derivations are not computed more than once. (This 
is a consequence of our optimization in Algorithm 6.) 
Note that this also holds for the tabular CP algo- 
rithm. 

• ELR parsing guarantees the correct-prefix property, 
contrary to the CP algorithm. This prevents com- 
putation of all sub derivations which are useless with 
regard to the already processed input. 

• ELR parsing is more deterministic than LC and PLR 
parsing, because it allows shared processing of all 
common prefixes. It is hard to imagine a practical 
parsing technique more deterministic than ELR pars- 
ing which also satisfies the previous two properties. 
In particular, we argue in [8] that refinement of the 
LR technique in such a way that the first property 
above holds whould require an impractically large 
number of LR states. 



This is reminiscent of the idea of "optimal cover" [5]. 



Epsilon rules 

Epsilon rules cause two problems for bottom-up pars- 
ing. The first is non-termination for simple realisations 
of nondeterminism (such as backtrack parsing) caused 
by hidden left recursion [7]. The second problem occurs 
when we optimize TD filtering e.g. using the sets Si', it 
is no longer possible to completely construct a set Si be- 
fore it is used, because the computation of a derivation 
deriving the empty string requires Si for TD filtering 
but at the same time its result causes new elements to 
be added to Si. Both problems can be overcome [8]. 

Conclusions 

We have discussed a range of different parsing algo- 
rithms, which have their roots in compiler construction, 
expression parsing, and natural language processing. 
We have shown that these algorithms can be described 
in a common framework. 

We further discussed tabular realisations of these al- 
gorithms, and concluded that we have found an opti- 
mal algorithm, which in most cases leads to parse tables 
containing fewer entries than for other algorithms, but 
which avoids computing identical sub derivations more 
than once. 
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